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otair memoers are creating new ways to evaluate con- 
tent of curricula, methods of teaching and the multiple 
effects or both on students. The CENTER is unique because 
of access to Southern Californians elementary, second- 
arv 2 !.'d higher schools of diverse socio-economic levels 
a?ia cultural backgrounds. Three major aspects of the x?ro- 
gxam are ‘• 

Varxables - Research iti this area 
^ 5? concerned with identifying and evaluating 
the effects of instructional variables, an.i with 
the development of conceptual models, learning 
theory and theory of instruction. The research 
involves the experimental study of the effects of 
differences in instruction as they may interact 
with individual differences among students. 



interactions of both with instructional programs. 

It will also involve evaluating variations in stu- 
dent and teacher characteristics and administrative 
organization. 

Criteri on Measures - Research in this field is con- 
cerned witiToreatTrig a new conceptualization of eva- 
lua.tion of instTuction and in developing new instru- 
monts to evaluate knowledge acquired in school by 
measuring observable changes in cognitive, affective 
and physiological behavior. It will also involve 
evaluating the cost-effectiveness of instructional 
programs. 






ences in community and school envi'^oniuents and the 
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The Implications and Use of Cloze Procedure 
in the Evaluation of Instructional Programs 

John R. Bormuth 



One purpose of this paper is to examine the utility 
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of the cloze readability procedux^e as a device tor evaluating 
the effectiveness of instructional programs. The cloze read 
ability procedure consists of a set of rules for selecting 
samples of verbal text from written instructional materials 
and for making, administering, scoring, and interpreting 
cloze tests made from those samples. In its essential form, 
the cloze readability procedure purports to be no more than 



a method for determining the extent to which students under- 
stand the instruction they receive from the written verbal 
material . 

Methods of the type represented by the cloze procedure 
are presently essential to the process of evaluating instruc- 
tional programs. Evaluations which include only a measure of 
the outcomes of a program and a judgment of their worth ig- 
nore the fact that the knowledge taught in an instructional 
program is selected in competition with other knowledge that 
is also valued. One of the most painful realitites of a curric 
ulum construction is the fact that much valued knowledge 
must be excluded because there are not sufficient time, money, 
and other resources to permit its inclusion. For this reason, 
the evaluation of an instructional program cannot be consid- 
ered complete unless it assesses the efficiency of each of 
the components of the instructional program. The cloze read- 
ability procedure has been developed to provide information 
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of this kindo 

This is not to say that the cloze readability procedure 
or any procedure in which materials are tested directly on 
the students represents the ideal approach to assessing the 
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the evaluator an indication o£ how much learning -y-esults from 
the exposure of students to the materials, but give him lit- 
tle information about how the various features of the mate- 
rials influenced that learning. If a mature science of in- 
struction existed, it would be possible, merely from an ex- 
amination of the features of the instructional materials, to 
calculate the kind and amount of influence any given feature 
or set of features would exert on the outcomes. Indeed, this 
might be said to be the ultimate objective of much of the re- 
search in school learning. Until this objective has been 
achieved, expedients such as the cloze readability procedure 
must play a vital role in the evaluation of instructional 
programs that utilize written, verbal instructional materials 
The second purpose of this paper is to examine the pos- 
sibility of developing a method, which incorporates the cloze 
procedure, for making criterion reference tests over verbally 
presented instruction. It is not possible to evaluate the 
outcomes of an instructional program unless there is some way 
to determine what content was taught and whether each item of 
content v/as learned. Conventional test making procedures 
offer no method for objectively deriving a list of the items 
of content taught. If such a list could be derived, conven- 
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tional test writing theory offers no objective procedure for 
deriving test questions from the list of content items. Be» 
cause there is no rigorous way to determine whether the know- 
ledge measured by a test is representative of the knowledge 
taught by the program, there is clearly no way to construct 
a criterion referenced test over verbally presented instruction. 

In the second section of this paper, test item writing 
theory will be cast into a more definite form than it has 
taken in the past. A procedure will then be proposed for mak- 
ing criterion reference tests from programs containing verbal- 
ly presented instruction. It will be seen in the course of 
this discussion that the use of the cloze procedure is es- 
sential for the selection of the items included in criterion 
referenced tests made from verbally presented instruction. 

Cloze Readability Procedure 

Cloze tests can be made in a variety of ways. When 
they are used to measure the comprehension difficulties of 
text materials, however, investigators almost invariably use 
a specific set of procedures called the cloze readability 
procedure. Cloze readability tests are made by deleting 
every fifth word from a passage. The deleted words are re- 
placed by underlined blank spaces of a uniform length, and 
the tests are mimeographed. 

Cloze readability tests are given to subjects who have 
not been permitted to read the passage. The subjects are 
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instructed to write in each blank the word they think was de- 
leted to form that blank. A response is scored ’’correct” when 
it exactly matches the word deleted. The difficulty of a pas- 
sage is the mean nf cun *1 ^ 4* C? t ^ ^ 1_ _ 

^ pw X v;ii i>v^UiCJ5 UJl tll6 

test . 



The difficulty of every word, phrase, clause, or sen- 
tience in the passage can also be determined by using five forms 
of a cloze test over the passage. To make the first form, words 
i, 6,, 11, etc. are deletedj words 2, 7, 12, etc. are deleted 
to m:ke the second form. This procedure continues until all 
five forms have been made and every word in the passage ap- 
pears as a cloze item in exactly one test form. The propor- 
tion of subjects writing the correct word in a blank is used 
as a measure of the difficulty of the word deleted. The dif- 
of the words within a phrase, sentence, or passage 
are averaged to determine the difficulties of those units. 



Other Evaluation Methods 

1 ity Formulas . Perhaps one of the chief reasons 
why instructional materials are not routinely evaluated to 
determine whether they have a suitable level of difficulty is 
that chere has been no technique that is at once convenient, 
economical, and valid. Readability formulas are convenient, 
inexpensive, and require only unskilled clerical assistance 
to use, but the formulas currently available have validities 
that range from .5 to only about .7. Further, the^ equations 
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take into account only a limited rainge o£ linguistic vari- 
ables and the variables that are taken into account are, by 
today’s standards, crude. Recent research by Coleman (1966a) 
and Bormuth (1966aJ shows that readability formulas having high 
validities can be developed, but the research that will obtain 
these formulas is still in progress. 

Direct Testing . Using conventional comprehension tests 
to test materials directly on students seems more valid than 
using readability formulas, but it is also expensive and un- 
reliable. Because the test items them.selves represent a 
reading task for the student, it is Uncertain whether it is 
the difficulty of the passage or the difficulty of the items 
that is measured by this procedure. 

Programming . Instructional programming might be said 
to be a third method of determining the difficulty of mate- 
rials. As programming is currently done, it is an expensive 
process. Further, programming techniques employ test items 
similar to those used in conventional comprehension tests. 

As a result, the criticisms leveled at the use of conven- 
tional comprehension tests hold also for programming. 



Validity of Cloze Readability Tests 
If cloze readability tests are to be used as a measure 
of the comprehension difficulty of written instructional ma- 
terials, evidence showing that the tests measure the reading 
comprehension abilities of students is needed. Further, it 
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must be shown that the difficulties of cloze tests correspond 
to the difficulties of other tests used to measure the dif~ 
ficulty subjects have in understanding materials. 

Criteria of Validity 

Two Concepts of Comprehens ion . It is necessary to an- 
alyze the concept of comprehension further, since there is a 
fundamental disagreement about which of two measurement oper- 
ations best represents the concept of comprehension ability. 
Traditionally, the comprehension ability of a person is mea- 
sured by having him read a passage and then testing his know- 
ledge of the content of the passage. But scores derived in 
this manner measure both the person’s knowledge acquired as 
a result of reading the passage and the knowledge he posses- 
sed before he read the passage. Comprehension measured in 
this way will be referred to as post-reading knowledge . On 
the other hand, many hold that comprehension ability is a . 
set of generalized skills enabling the individual to ac- 
quire knowledge from materials. Reasoning from this point 
of view leads to the claim that comprehension ability is 
best represented by a score obtained by finding the dif- 
ference between scores on a test administered before and af- 
ter the passage is read. Comprehension measured in this way 
will be referred to as Know* edge gain . 

Value Placed on Both Concepts . Both conceptualizations 
of comprehension are relevant to the evaluation of instruction- 
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al materials. Of course it is highly desirable to select 
materials from wh'’.ch students acquire much new knowledge. 
Despite this , previously acquired knowledge is deliberately 
included in materials in order to provide the repetition 
essential for retention and in order to state the relation- 
ships between knowledge previously acquired and the know- 
ledge being presented for the first time. Hence, a measure 
used to a^>sess the comprehension difficulty of materials 
should, ideally, be capable of measuring comprehension in 
eitner or both of these ways, since both represent desir- 
able characteristics of materials. 



Validity Research 

Measurement of Post-Reading Knowledge . Nearly all the 
validity research on cloze readability tests has concentrated 
on demonstrating their validities as measures of post-read- 
ing knowledge. It seems that only one study approached this 
problem experimentally. Bormuth (1962) made a cloze and 
multiple choice test over each of nine passages. The passages 
were written so that they varied systematically in subject ' 
matter and language complexity. Both sets of tests were 
given to subjects irx grades 4, 5, and 6. Each of the main 
effects and the interaction between language complexity and 
subject matter produced significant and roughly proportionate 
effects on the cloze readability and multiple choice scores. 
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A large number of studies have reported correlations 
between cloze readability test scores and scores on tests 
ot the type to which the label "comprehension" is conven- 
tionally applied. The first studies discussed used compre- 
hension tests made from the same passages as the cloze 
tests. Taylor (1956), using Air Force trainees as subjects, 
found a correlation of .76; Jenkinson (1957), using high 
school students, found a correlation of .82; Bormuth (1962), 
using elementary school pupils, found correlations ranging 
from „ 73 to .84; and Friedman (1964), who used college stu- 
dents, gave comprehension tests consisting of 8 to 12 items 
each and obtained correlations ranging from ,24 to .43. 

These correlations seem high in view of the fact that, 
where test reliabilities were reported, the validity cor- 
relations and the reliabilities were approximately of the 
same magnitude. 

A fairly large number of studies have reported cor- 
relations between cloze readability tests and standardized 
tests of reading achievement. Table 1 shows the studies 
and the correlations reported. It is difficult to inter- 
pret these correlations because the authors frequently 
failed to report the variances and reliabilities of the 
tests for the subjects used in their studies. This was a 
problem especially in the studies using college students. 
College; students could be expected to exhibit a curtailed 
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Table 1 

Correlations Between Cloze Readability Tests and 
Standardized Tests of Reading Achievement 



Study- 



Subjects 



Tests 



Correlations 



Jenkinson (1957) High School 

Rankin (1957) College 



Fletcher (1959) College 



Hafner (1963) College 

Ruddell (1963) Elementary 

(6 cloze tests) 

Weaver and Kingston College 
(1963, 2 cloze tests) 

Green (1964) College 



Friedman (1964) College 

(20) cloze tests) (Foreign 

Students) 



Cooperative Reading C2 

Vocabulary ,78 

Level of Comprehension .73 

Diagnostic Survey 

Story Comprehension .29 

Vocabulary ,68 

Paragraph ,60 

Cooperative Reading C2 

Vocabulary ,63 

Level of Comprehension .55 
Speed of Comprehension .57 

Dvorak-Van Wagenen 

Rate of Comprehension .59 

Michigan Vocabulary Profile .56 

Stanford Achievement 

Paragraph Meaning .61-. 74 

Davis Reading .25-. 51 

Diagnostic Reading Survey 

Total Comprehension .51 

Metropolitan Achievement 

Vocabulary .63-. 85 

Total Reading .71-. 87 
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distribution 
sizes of the 
account, the 

hxigh. 



of Individual differences which would reduce the 
correlations. But, when this fact is taken into 
correlations shown in Table 1 seem reasonably 



Two studies investigated the factor validities of cloze 
tests. Weaver and Kingston (1963) performed a principle com- 
ponent analysis on the correlations among various tests. The 
tests included some classifiable as cloze readability tests; 
they also included a standardized test of reading comprehen- 
sion. The cloze tests exhibited low correlations with the 
principal component with which the comprehension test had its 
highest correlation. Bormuth (1966b) pointed out that this 
study contradicted the findings of much of the earlier re- 



search on cloze tests, 
relations involving other 
relation patterns that we 



In brief, he showed that the con- 
tests in the battery exhibited cor- 
re highly unusual for them, and that 



the population of subjects exhibited a curtailed range of vari 
ability. He then presented an analysis of data from an earlie: 
study (1962) which showed that a single component accounted 
for nearly all the variance in a set of cloze tests and multi- 
pie choice comprehension tests. 

Measurement of jtoowledge Gain . There is still only a 
small amount of information bearing on the question of whether 
cloze tests are useful as measures of knowledge gain, and this 
scant information is indirect. Taylor (1956) and Rankin (1957) 
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each found that subjects who read tlvi. intc:.;t passage:- before 
taking the cloze tests made from thost- p:jssage:> acr?ie>i’ui high- 
er scores than subjects who had not read the passages. L-\ 
the other hand, Green ( 1964 ) found that having subjects r6*ad 
the passages before taking the cloze tests did not increase 
their cloze scores over the scores they achieved cji lo/e 
test given them before they read the passage 'la; » (1965) 
challenged Green’s results pointing out that l-rce bailod to 
correct for the regression effects present i.a ; - udies using 
this design. 

Measurement of Passage Difficulty . A reasonably sub- 
stantial amount of research has accumulated showing tliat 
cloze readability test difficulties correspond closely to the 
difficulties of passages as measured by other methods. Taylor 
(1953), the originator of the cloze procedure, found that 
cloze readability test difficulties ranked the passages in 
the same order in which the readability formulas ranked them. 
When he selected three additional passages which, when judged 
subjectively, ranked one way but, when analyzed by readability 
formulas, ranked in the reverse order, the cloze readability 
test difficulty rankings agreed with the subjective judgments. 
Sukeyori (1957) found a correlation of .83 between the com- 
bined subjective rankings given eight passages by three judges 
and cloze readability test difficulties of the passages. 
Bormuth (1962) found a correlation of .92 between the cloze 
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readabilities o£ 9 passages and the difficulties of multiple 
choice comprehension tests made from the same passages o In 
a more recent study Bormuth (1966) used four sets of 13 pas- 
sages each and found correlations ranging from .91 to .96 
between the cloze readabilities and the comprehension diffi- 
culties of the passages. The correlations between the mean 
number of words pronounced correctly by subjects who read 
the passages orally and the cloze readabilities of the pas- 
sages ranged from .90 to .95. 

Cloze Test Relnabil ity . When cloze readability tests 

are used only as measures of the relative abilities of sub- 
jects, they are probably somewhat less reliable than well- 
made multiple choice tests containing the same number of 
items. For example, Bormuth (1962) found that the reli- 
abilities of the nine, 31 item, multiple choice tests used 
in his study exhibited reliabilities about equal to those of 
the nine, 50 item, cloze readability tests made from the same 
passages. It seems likely that this may have resulted from 
the fact (Fletcher 1959 and Bormuth 1962) that cloze read- 
ability tests nearly always contain a number of very difficult 
and very easy items which are less efficient discriminators 
(Davis 1949) than items in the intermediate range of dif- 
ficulty. However, the large number of very difficult and 
very easy items appea.ring in cloze readability tests is actu- 
ally an asset, making the tests useful in testing subjects 
differing widely in ability. Zero scores, maximum scores. 
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and skewed distributions are rarely observed when cloze read- 
ability tests are carefully administered. But this range ap- 
parently has its limits. Gallant (1964) found that cloze 
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with first grade children. 



Application of the Cloze Readability Procedure 

A substantial body of research has dealt with the tech- 
nical questions arising when cloze readability procedure is 
used to evaluate the difficulty of instructional materials. 
The results of this research seem to justify the application 
of the procedure to a range of evaluation tasks. The fol- 
lowing discussion considers the major problems encountered 
at each step and discusses the research dealing wath those 
problems . 



Designing the Testing Procedure 

Cloze readability procedure may be adapted either to 
measuring the difficulties of short or long passages or to 
measuring the difficulty of a given piece of material for an 
individual or for a whole group. Because the number of pos- 
sible testing designs are almost infinite, only three designs 
will be discussed to illustrate the principles and problems 
of designing materials evaluation studies. 

Multiple Sampling Problems . When the cloze readability 
procedure is used to determine the difficulty of a text, the 



investigator often deals simultaneously with three samples. 
First, because it is often impractical to test materials on 
the whole population with whom the materials are to be used, 
the investigator draws a sample of pupils to represent this 
population. The accuracy of his results depends, in part, 
on the extent to which the sample is representative of the 
population. 

Second, the items in a cloze test represent only a sam- 
ple of the items that can be made over that passage. When 
long texts are evaluated, it may be an inefficient use of 
resources to make all five of the cloze test forms over the 
passages studied. Therefore, the investigator must sometimes 
deal with what is called item sampling error. The Kuder- 
Richardson (1937) formula 21 for calculating test reliability 
takes item sampling error into account (Lord 1955) . The 
error of the mean that is due to item sampling error may be 
usefully estimated by Lord’s (1955) formula 21. A less com- 
plicated procedure is to use two or more cloze test forms 
over the same passage, and then calculate the variance of 
the form means. Subtracting the population sampling error 
variance from the variance of the form means gives an esti- 
mate of the item sampling error. 

Third, when a lengthy text is evaluated, it is generally 
not practical to make a cloze test over its full extent. In 

consequence, sample passages laust be drawn :rom the text and 
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the cloze tests made over just the sample passages. Hence, 
the investigator must consider passage sampling error. Pas- 
sage sampling error can be estimated by finding the difficulty 

UX \JX. UllV pCI.OOCI.gVO a-il Cixv OC*.iup.xwj ^ ^ V*.!. V* w .1. 

ance of the passage difficulties, and then subtracting the 
population and item sampling error variances. 

Designs . An elaborate design for a text evaluation 
study might follow these steps. First, the sections of the 
text are numbered consecutively and passages drawn randomly 
from each chapter. Two or more passages are drawn from each 
chapter so that the relative difficulties of different chap- 
ters can be compctied. Second, two or more forms of a cloze 
test are made from each passage. The tests should be nearly 
identical in the number of items they contain. Third, the 
sample of pupils is drawn randomly, or as nearly so as pos- 
sible, from the population with whom the cloze tests are to 
be used, and each pupil is randomly assigned to take one of 
the cloze tests. When two or more texts are being evaluated, 
this design permits the investigator to use analysis of vari- 
ance to ascertain whether the materials differ significantly 
and to determine how variable each text is from chapter to 



chapter. 

A less expensive procedure consists of using shorter 
passages — passages of about 50 words. Two forms of a cloze 
test are made from each passage and the passages are formed 
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into a single test having two forms. The tests are then giv- 
en to pupils drawn randomly from the population. This pro- 
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texts. It does not, however, permit the comparison of chap- 
ters within a text. It is also less reliable because shorter 
passages were used. 

The simplest problems are presented by the evaluation 
of short passages such as test items, picture captions, and 
other passages of less than 1,000 words. All five forms of 
a cloze test are made from the passage and each form is giv- 
en to a different, randomly-selected sample of pupils. Where 
the passage is very short, (containing fewer than 3 u X l0Iu3 } ^ 
it is doubtful that individual scores are sufficiently re- 
liable to permit an accurate judgment of how well a given in- 
dividual understood the passage . The results do provide an 
accurate estimate of how well the group as a whole understood 
the passage . 

Problems . The first problem encountered is deciding 
how many pupils, cloze test items, and sample passages should 
be used. Increasing the number of each reduces the error in 
estimating the difficulty of the materials, but by different 
amounts, iormuth (1965a) found that increasing the number 
of items in a cloze test reduces error more rapidly than ad- 



ding the same number of students. There is, at present, no 
data on the relative size of the error resulting from pas- 
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sage sampling. The second problem stems from the conjec- 
ture that the difficulty of a sample passage from a text 
may depend, in some degree, on whether the pupil has stud- 
ied the text preceeding the passage. While this may pre- 
sent little problem in most content areas, it is conceiv- 
able that in areas such as the science, the effect could 
be considerable. This would seem to indicate that some 
evaluation studies should be designed to accompany in- 
struction in such a way that the pupil is tested on a pas- 
sage just before he is to study the section containing that 
passage . 

Dele lion Procedure 

While nearly all readability research employs tests 
made by deleting every fifth word, cloze tests can be made 
by deleting every nth word, words at random, or just the 
words of a given type. The only restriction is that the 
words deleted must be selected entirely by an objectively 
specifiable process, otherwise the test must be classified 
as a common completion test (Taylor 1953). 

Cloze test users encountered the problem of discov- 
ering how many words of text had to be left between cloze 
items. Leaving fewer words between items makes it pos- 
sible to obtain a larger number of items from a given 
length of text and reduces the number of test forms that 
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have to be made in order to eliminate item sampling error. 
Leaving too few words between items, by contrast, intro- 
duces the possibility that items will exhibit statistical 
dependence of the sort where the probability of a subject 
responding correctly to an item is dependent upon whether 
or not he is able to answer adjacent items. When apprecia- 
ble statistical dependence exists, test scores cannot be 
treated by conventional statistical methods v MacGinitie 
(1961) studied the problem by varying the number of words 

of text left intact on either side of a set of cloze items. 

1 

He was unable to detect any dependence aiong items when 
four or more words of text were left between items. 

Taylor (1953) pointed out that methods involving the 
deletion of only words belonging to certain categories had 
to be excluded for use in readability studies because the 
frequency with which such words occur in a passage may it- 
self be a variable influencing the difficulty of the pas- 
sage. There seems to have been no research dealing with 
some of the more technical problems in the deletion process 
such as the problem of what should be deleted when a numeral 
is encountered. For example, should 128 be treated as if 
it contained three words or should it be deleted as a unit? 
It is not even clear if a criterion can be found for de- 
ciding issues of this sort. 
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Test Administration 

The two principle alternatives in administering a cloze 
test are to give it either to subjects who have not read the 
passage or to subjects who have first been exposed to the pas- 
sage. Giving the cloze test to subjects who have not read 
the passage obviously uses time more economically. Moreover, 
it might be argued that giving a cloze test to subjects after 
they have read the passage causes scores to be influenced by 
the subject’s rote memorization of the passage. (Rote memory 
is a process commonly held different f? om comprehension). 

The results of validity studies indicate that it makes 
little difference which method is used. For example, Taylor 
(1956) found that scores on cloze tests administered after 
subjects had read the passages exhibited both slightly great- 
er variances and slightly higher correlations with compre- 
hension tests than cloze tests administered to subjects who 
had not read the passage. Rankin’s (1957) studies showed 
the same results. The greater variance alone seems suffi- 
cient to account for the increased correlation. Consequently, 
when greater validity or reliability is desired, it is prob- 
ably more economical to obtain it by increasing the number 
of items in the cloze test and by giving the tests to sub- 
jects who have not read the passage. 

Scoring Procedure 

A response can differ from the deleted word in semantic 




ni63.iiing, grfimnici.'t ical inflection, and. spelling. Users of cloze 
readability tests nearly always score "correct” just those 
responses where the stem of the response, the uninflected form 
of the word, exactly matches the word deleted. The research 
seems to support this practice. Taylor (1953) found that 
scores obtained by counting synonyms in addition to responses 
exactly matching deleted words were no better than scores ob- 
tained by counting only responses exactly matching the words 
deleted when the scores were used to discriminate among pas- 
sage difficulties. Rankin (1957) and Ruddell (1963) found 
that scores obtained by counting words exactly matching and 
synonyms of the deleted words resulted in the scores having 
slightly, but not significantly, greater variances and cor- 
relations with scores on comprehension tests. 

In the past, some investigators scored responses "cor- 
rect” v/hen they were inflected differently from the deleted 
word. Bormuth (1965b) studied the correlations between com- 
prehension test scores and several categories of cloze test 
scores which were obtained by counting responses classified 
according to whether their inflections were correct in the 
context of the blank and further classified according to 
whether the stem of the response exactly matched, vras syn- 
onomous with, or semantically unrelated to the deleted words. 
All scores obtained by counting grammatically correct re- 
sponses exhibited positive correlations. The correlation 
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involving a count of exactly matching responses was .84; 
the one involving a count of synonyms was .64; and the one 
involving semantically unrelated responses was .56. All 
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be indistinguishable from zero. Further, a multiple regres- 
sion analysis showed that scores based on a count of the 
responses which exactly matched the deleted words in both 
inflection and word stem accounted for 95 per cent of the 
comprehension test variance that could be predicted from 
the total set of cloze test scores. Thus, it would seem 
that the most economical and objective method of scoring 
cloze tests, the exact word method, yields the most valid 
results . 

Most investigators score misspellings correct when the 
response is otherwise correct and when the misspelling 
does not result in the correct spelling of another word that 
also fits the syntactic context of the blank. No research 
seems to have tested the validity of this practice. Simi- 
larly, the influence of illegibly written responses has not 
received study. 



Interpretation of Scores 

The difficulty of a text should be reported in terms 
that make clear how appropriate the text is for a given in- 
dividual or group. This may be accomplished either by stating 
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the proportion of the group which is able to achieve cloze 
readability scores at or above some criterion level of per- 
formance or by stating the level of achievement possessed 
by pupils who are able to attain the criterion level of per- 
formance. To do either requires that a criterion score on 
cloze readability tests be established as representing an 
acceptable level of understanding a passage. 

Crite rion Score . Establisning a criterion of acceptable 
performance on a cloze readability test presents two major 
problems. First, since cloze readability tests have been 
in use for only a short time and since they differ radically 
in difficulty from conventional tests, users have not yet 
developed a "feel" for what is acceptable performance on a 
cloze test. Second, the establishment of a criterion score 
has traditionally been viewed as a matter to be left to per- 
sonal preference or arbitrary choice rather than as a mat- 
ter for rational decision based, at least in part, on em- 
pirical data. 

The most direct approach to establishing a criterion 
score for cloze readability tests is to adopt a criterion 
score traditionally used and then to determine what cloze 
score is comparable to this criterion score, Bormuth (1966c 
and 1966d) adopted the 75 per cent criterion score which 
has a long tradition of acceptance (Thorndike 1917) and wide 
spread use in current practice (Betts 1946 and Harris 1962). 
According to this criterion, a passage is said to be suit- 
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able for use in a pupil’s instruction if he responds cor- 
rectly to 75 per cent or more of the questions asked him 
about the passage. In one study, Bormuth used multiple 
choice tests and had the pupils read the passages silently. 

In the other study using different materials and subjects, 
he used short answer completion tests and had the pupils 
read the passages and respond to the questions orally. In 
both studies a cloze score of about 44 per cent was found 
to be comparable to the 75 per cent criterion. Since the 
exact word method of scoring was used in both studies, this 
cloze criterion score is useful only for interpreting other 
cloze readability tests scored according to that method. 

A more adequate approach to the establishment of a 
criterion score was demonstrated by Coleman (1966b) who set 
out to determine what level of passage difficulty resulted in 
the greatest amount of information gain on the part of 
students reading the passages. He measured information gain 
by typing the passage on a transparency and covering the 
words with strips of tape. When this was projected, the 
student was asked to guess and write down the first word. 

That word was then exposed and the student was asked to 
guess the next. Following the first run through the passage, 
the tape was replaced and the procedure repeated. The dif- 
ference between a student’s scores on the two trials was taken 
as a me isure of information gain. Passage difficulty was 
determined on a matched group of subjects using cloze read- 



ability tests. Interestingly enough, his results seemed to 
show that maximum information gain occurred on passages hav- 
difficulties of close to 44 per cent, the cloze score 
found to be comparable to the traditional 75 per cent criter 
ion. A question has been raised (MacGinitie 1966) about 
whether the ’’information gained" by the subjects in Coleman’ 
o cuu.)r wcio influenced unduly by rote jnemonzation. Whatever 
the merits of that conjecture, it seems clear that Coleman’s 
study dem.onstrated how a rational approach can be made to 
the establishment of criterion scores. 

Reporting Passage Difficulty . The simplest method of 



reporting difficulty scores is to report the mean difficulty 
of the text and the proportion of subjects whose score ex- 
ceeded the criterion score. However, this method limits the 
general usefulness of the results. It is ofter impossible 
to draw the subjects in such a way that they are a represen- 
tative sample of the pupils with whom the materials are to 
be used. There is no way to be sure, therefore, that the 
proportion of subjects who reached the criterion score in 



the sample will represent the proportion in the population. 
And, even if the sample of subjects were representative of 
the population in a school system, it is virtually certain 
that the sample would not be representative of the subjects 
in the total population of pupils with whom the materials 



are to be used. Since text readability studies are of gen- 
eral interest and since they are somewhat costly to conduct, 
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it seems advisable to use a somewhat more generally useful 
method of reporting the difficulty of a text. 

A fairly easy method to use results in giving a grade 
placement number to the text. First, the subjects* scores 
on the cloze readability tests are correlated with their 
scores on a test of reading achievement; then, using the 
regression prediction formula, the achievement grade place- 
ment score that corresponds to the cloze readability crite- 
rion score is calculated. The grade placement score can 
then be interpreted as the average achievement of subjects 
who were able to attain the criterion level of performance 
on the cloze tests made from the text. Other schools using 
the same achievement test can estimate the appropriateness 
of the text for their pupils by determining what proportion 
of the pupils have achievement scores that exceed the re- 
ported passage grade placement. Further, since there are 
many published studies of the comparability of achievement 
test norms, the results should be useful almost regardless 
of what achievement test a school uses. 

Conclus ions 

The use of the cloze readability procedure seems to re- 
sult in valid measurements of the comprehension difficulty 

of written instructional material. The correlations between 
cloze readability and conventional comprehension test scores 
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are high, and none of the research has presentee convincing 
evidence that the processes employed in responding to cloze 
readability tests are, in any major sense, distinguishable 
from those employed in responding to conventional comp:;e- 
hension tests. Moreovor, passage difficulties determined 
using cloze readability tests correspond closely to the pas- 
sage difficulties obtained using other measures. 

The cloze readability procedure has a number of ad- 
vantages not shared by other available methods of deter- 
mining difficulty. Unlike the conventional test items used 
in other methods where materials are tried out directly on 
students, cloze test items are easily made and do not inject 
irrelevant sources of variance into the measurement of dif- 
ficulty. Further, Cxoze readability procedure yields far 
more valid results than the readability formulas presently 
available. However, when the readability formulas, now in 
developmental stages, become available for general use, they 
will probably be almost as valid and much less costly to use 
than the cloze readability procedure. 

Research on the technology of the cloze readability 
procedure seems sufficient to permit the application of 
this procedure to a wide range of materials evaluation tasks 
but three important problems remain to be solved. First, 
it is not at all certain if cloze readability tests can be 
used to measure knowledge gain. Second, a criterion level 
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o£ performance has yet to be established on a rational basis. 
Third, it still must be determined if the act of isolating a 
passage from its context affects the difficulty of the pas- 
sage. A few other problems are also unsolved. For in- 
stance, there is the question of how to handle numerals in 
the word deletion rules. None of the problems seriously 
impairs the usefulness of the cloze readability procedure 
in improving the quality of materials evaluation studies. 



Cloze Tests in the Evaluation of the Outcomes of Instruction 
Little attention has been given to exploring the poten- 
tial uses of cloze tests as measures of the knowledge stu- 
dents gain as a result of instruction. The reason may be 
that educators demand that achievement test questions seem 
valid, at least intuitively, as measures of the knowledge 



imparted by instruction. While cloze tests may be made from 



the instructional materials themselves, it has remained ob- 
scure just how a given cloze test item might test the know- 
ledge gained in instruction. 

This section will advance the claim that there is a 
formal similarity between some types of cloze test items and 
the conventional completion and multiple choice test items 
generally accepted as tests of the achievement of knowledge. 
It should be emphasized that the remainder of this dis- 
cussion is no longer confined to a consideration of just the 
cloze readability procedure but is extended to the consid- 
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eration o£ cloze tests per se — that is, to tests made by 
deleting any objectively definable language unit according 
to a set of pre-specif ied rules. 

The argument supporting the claim that there are for- 
mal similarities between cloze and many of the conventional 
achievement test items is based on reasoning that takes this 
form. Where instruction is given in natural language, it 
can be analyzed into a list of sentences. Most of the ques- 
tions that can be asked about a sentence can be expressed as 
transformations performed on the syntax o£ the sentence, 
coupled with the substitution of semantic equivalents for the 
words and phrases in the sentences. The transformation per- 
formed on the syntax of a sentence has the effect of dele- 
ting the portion of the sentence which becomes the correct 
response to the question. The substitution of semantic 
equivalents can also be expressed as transformation on two 
or more of the sentences in lists, where the instruction has 
been systematic. Cloze tests can be produced by these same 
manipulations . 

Conventional Test Items 

Instruct ion aj^ ^ List of Sentences ; Verbal in “struct ion 
can be usefully regarded as a list of sentences whose truth 
values have been verified. If one were to construct such a 
list, it could consist either of the sentences in the exact 



mmm 









ik'mmmm 



-29- 



order in v/hich they occurred in instruction or of an unorder- 
ed list that contains a somewhat larger number of sentences. 
Consider this instructional passage: 

Some arctic explorers were forced to eat raw polar 
bear meat. Many died before returning home. Polar 
bears are often infected with trichinosis. 

The list of sentences made from this passage may consist of 
just these sentences in their present order. But, because 
sentence order in connected discourse transmits information, 
an unordered list must contain sentences stating the infor- 
mation signaled by the sequeiAce in which sentences occur. 
There are no well defined procedures for analyzing the syn- 
tax of discourse. However, an unordered list made from the 
instructional passage above might contain sentences 1 through 
6 listed below. Presumably, this list of sentences contains 
all the information contained in each of the sentences taken 
separately (sentences 1, 2, and 4) 

1. Some arctic explorers were forced to eat raw 
polar bear meat. 

2. Many arctic explorers died before returning home. 

3. Eating raw polar bear meat has caused the death of 
some arctic explorers. 



4 

5 



Polar bears are often infected with trichinosis. 

The deaths of some arctic explorers has been caused 
by trichinosis. 

6. Some arctic explorers contracted trichinosis as a 
result of eating raw polar bear meat. 

plus all the information contained in the ordering of the 

sentences relative to each other (sentences 3, 5, and 6). 
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The act of making an unordered list of sentences may be what 
instructional programmers loosely refer to as "making explic- 
it” the content of instruction. 

Derivation of Sentences : If the instruction in an area 

of discourse is systematic and complete, the list of sentences 
^'Ontains or permits the derivation of all (and only) 
the true sentences that can be stated about that area of dis- 
course. These derived sentences are regarded here as a part 
of the list, but not as a part of the sentences actually 
used in instruction. 

If the instruction from which the passage above was 
drawn werv- systematic and complete, it would have been pre- 
ceeded by sentences defining the concepts and the relation- 
ships among the concepts used in the passages. The following 
are examples of sentences like some of those that might be 
found in the preceeding instruction: 

7. An explorer is a person who is among the first 
to examine a region. 

8. Polar boars are white bears living in the arctic 
regions. 

9. The arctic is the region lying near the North 
Pole of the Earth. 

10. An uncooked substance is raw. 

Ultimately, the instruction would involve contact ^^ith con- 
crete objects and sentences naming those objects. 

In this discussion, the derivation of true statements 
about an area of discourse refers to three kinds of behavior. 
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iFirst, derivation refers to the act of tri sforming a complex 
sentence into kernels. For examni^ ....licence 1 can be trans- 
formed into the sentences Some explorers were forced to eat 
meat , The explorers explore the arctic , The meat was raw, The 
meat v/as from bears , and The bears lived m the polar region . 
These kernel sentences can be derived by mechanical processes. 
Second, derivation refers to the act of deriving sentences 
such as sentences numbered 3, 5, and 6 whu.i are implied but 
not explicitly stated in the instructional passage. While 
most linguists think it likely that sentences of this type 
may be derivable by a series of relatively mechanical trans- 
forraations of the sentences used in instruction, they do not 
presently have sufficient knowledge of inter-sentence syntax 
to permit us to specify the nature of the transformations for 
deriving them. Third, deri'*<;ation refers to the act of substi- 
tuting for one word or phrase another word or phrase that was 
equated with it by one of the sentences in the instruction. 

For example, the sentence Some people who were among the first 
examine the region around the North Pole of the Earth were 
forced to eat uncooked meat from the white bears living in the 
region around t he North Pole of t he Earth was obtained largely 
by substituting equivalent phrases contained in sentences 7, 

8, 9, and IG for the words in sentence 1. Again, note that 
this is a relatively mechanical process. 

Test of Knowledge ; The ultimate test of whether a body 
of knowledge has been mastered is whether the student behaves 
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appropriately in the environment referred to by the sentences 
in the instruction = However, practical considerations force 
educators to settle for less than conclusive proof of masterv. 
for it is inconvenient to bring elephants into the cictssroom 
or to recreate historic disasters, wars, and decisions for 
the purpose of testing a student's knowledge of things of 
this sort. Instead, educators rely on some form of verbal re- 
sponse by the student. The student may be asked to write es- 
say exams or to answer objective questions about the instruc- 
tion. 

Having students write essay examinations might be con- 
ceptualized in this context as testing a student's ability to 
select and repeat the sentences actually used in instruction 
and/or the sentences he derived from the instructional passage. 
Of greater interest here is the fact that answering objective 
test items can be conceptualized as filling the blanks left 
in the sentences. 

Question Transformations : Many (and perhaps all) of 

the verbal questions used in objective tests can be repre- 
sented as transformations performed on the sentences in an 
unordered list or on the sentences derived from such a list. 

An important consequence of this assertion is the fact that 
a fairly simple set of rules is sufficient to specify the pro- 
cedures for writing these items and the procedures make the 
item writing process completely objective and reproducible. 
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Of broader significance is the fact that this set ox rules 
completely specifies the total population of the test items 

C3.I1 1)0 Wirittl0n OVRT* OriAT'Pn imnT'HpTpH 1 t c*f* n-P c nn ^-n Cfc c 

making it operationally meaningful to speak of sampling a 
population of test items over an area of discourse. 

There are three general classes of transformations that 
can be employed for turning a sentence into a question. The 
first is the yes-no transformation which results in questions 
answerable by the simple response of yes or no as in the ques- 
tion Were some arctic explorers forced to eat raw polar bear 
me^. Since questions of this sort are seldom used in testing, 
they will not be further considered, but much of what will be 
said subsequently, also applies to the yes-no question. The 
second is the completion question made by deleting a word, 

phrase, or clause from a sentence as in Some 

were forced to eat raw polar bear meat . The third is the 
wh- question in which a wh- question marker (when, where, how, 
why, what, how many, etc.) is inserted in the place of a word, 
phrase, or clause and the word order of the sentence is some- 
times rearranged. The question Who were forced to eat raw 
polar bear meat is an example. 

The completion question is perhaps tlie easiest of all 
questions to generate. A sentence is selected from the list, 
a word or phrase is selected from the sentence, and the word 
or phrase is replaced by a blank space. Table 2 shows the 
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Table 2 



The completion questions that can be made from 
the sentence in Figure 1 






Node 



Question 



s 


Some 


Arctic 


explorers 


(were 


forced 


to eat raw polar ' 


%ri4nn+*^ S 

UCCLX iJLV Ct. ^ ^ • 


s 


(Some Arctic 


; explorers) were forced 


to eat raw polar 


bear 


meat . 


NP^ 


(Some) Arctic explore: 


rs v;ere force 


d 


to eat raw polar 


bear 


meat . 


t-i 


Some 


(Arctic 


; explorers) were force 


d 


to eat raw polar 


bear 


meat. 


MN^ 


Some 


(Arctic) explorers wer 


e forced 


to eat raw polar 


bear 


meat . 


MN^ 


Some 


Arctic 


(explorer; 


s) were forced 


to eat raw polar 


bear 


meat . 


VP^ 


Some 


Arctic 


explorers 


(were 


forced) 


to eat raw polar 


bear 


meat . 


VPl 


Some 


Arctic 


explorers 


were 


(forced) 


to eat raw polar 


bear 


meat . 




Some 


Arctic 


explorers 


were 


forced 


(to eat raw polar 


bear meat) . 


NPo 


Some 


Arctic 


explorers 


were 


forced 


(to eat) raw polar 


bear 


meat . 


N?2 


Some 


Arctic 


explorers 


were 


forced 


to 


(eat) raw polar 


bear 


meat . 


NP2 


Some 


Arctic 


explorers 


were 


forced 


to 


eat (raw polar 


bear meat) . 


MN2 


Scm.e 


Arctic 


explorers 


were 


forced 


to 


eat (raw) polar 


bear 


meat . 


MN 2 


Gome 


Arctic 


explorers 


were 


forced 


to 


eat raw (polar 


bear meat) . 


CN 


Some 


Arctic 


explorers 


were 


forced 


to 


eat raw (polar bear) 


meat - 


CN 


Some 


Arctic 


explorers 


were 


forced 


to 


eat raw polar b 


ear (meat) . 


MN 3 


Some 


Arctic 


explorers 


were 


forced 


to 


eat raw (polar) 


bear 


meat . 


MN 3 


Some 


Arctic 


explorers 


were 


forced 


to 


eat raw polar (bear) 


meat , 



^Underlined portion of each sentence is the portion of the sentence 
deleted to form the question. 







completion questions that can be formed from the 
shown in Figure 1 . An important feature of the 
question is the fact that, where more than one 
deleted, the deleted v/ords invariably constitute 
This may be verified by tracing the derivations 



sentence 
completion 
word is 
a phrase, 
of the de- 



leted words up through the phrase structure tree in Figure 
1. The deleted units invariably constitute all the words 
dominated by a single phrase node. Deletions which cut 
across phrase boundaries such as The little 

hor^, virtually never occur in tests. Evidently, the struc- 
Liire of the language- requires that all deletions constitute 



a phrase. 

The second feature that should be noted is the fact 
that structural words (the class of words consisting prin- 
cipally of articles, prepositions, conjunctions, auxiliary 
and modal verbs, and infinitive markers) are never deleted 
as single words. In short, questions like little 

horse never occur in tests. When confronted 
with questions of tnis sort, people respond by trying to 
find a lexical word (consisting roughly of verbs, nouns, 
adjectives, and adverbs) to fit the blank, and complain 
that the question does not really test their knowledge. 
Again, this appears to reflect a property of the language. 

The wh- question is, in many respects, identical to the 



completion question. A sentence is selected, a phrase or 
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word within the sontonc© is S6l©ct6d, that word or phrase is 
deleted and replaced with a wh- phrase, and then (usually, 
but not always) the word order of the sentence is rearranged 
so that the sentence begins with the wh-phrase. Table 3 
shows the wh- questions that can be written over the sentence 
in Figure 1. The units deleted are either individual lexi- 
cal words or phrase units, as in the deletion questions. 

A variation on the wh- question is sometiines observed 
in tests. This consists of questions made by replacing a 
lexical word or phrase with a wh- phrase and then neglecting 
the step of rearranging the word order. This results in 
questions like The 1 itt le what rode the horse or The what rode 
horse . Wh- questions of this sort are almost identical 
to the completion question, the only distinction is the use 
of a wh- phrase instead of a blank. 

By now it should be evident that cloze and the conven- 
tional completion and multiple choice items are similiar in 
that both are made by a deletion process. Moreover, it is 
possible to define a cloze procedure that would delete only 
the words, phrases, or clauses that can be deleted by conven- 
tional item writing procedures. 



Comparison of Cloze and Conventional Tests 

of Selecting Test Items : The chief distinction 

between cloze and conventional tests is in the methods used 
to select the items to appear in the test. Cloze tests must 
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Table 3 

Wh- questions obtained by applying transformations to 
the nodes of the sentence structure in 

Figure 1 



Node 


Question 


Constituent Deleted 


C 


What happened to some Arctic explorers 

and 

Who ‘.ere forced to eat raw polar bear meat? 


were forced to eat 
raw polar bear meat. 
Some Arctic explorers. 


NPj^ 


Wbj-ch Arctic explorers viere forced to eat 
raw polar bear meat? 


Some 


MN,* 


Some of what kind of explorers were forced 
to eat raw polar bear meat? 


Arctic 


VPi 


What were some Arctic explorers forced to 
do? 


eat raw polar bear 
meat 




(None. This node dominates only one lexi- 
cal constituent.) 




\rv\ 

”*"2 


What were some Arctic explorers forced to 
eat? 


raw polar bear meat 


Inf 


(None. This node dominates only one lexi- 
cal constituent.) 




MN^ 


What kind of polar bear meat were some 
Arctic explorers forced to eat? 


raw 


CN 


What kind of raw meat were some Arctic 
explorers forced to eat? 


polar bear 


MNj 


What kind of raw bear meat were some 
Arctic explorers forced to eat? 


polar 



*One occasionally encounters questions like "Some Arctic what were forced 
to eat raw polar bear meat?" These questions have the effect of delet- 
ing the noun. It was omitted here only because it sounds a bit awkward 
to native speakers of English and because it does not follow the rule 
of shifting the wh- phrase to the initial position in the sentence. 
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be made vjithout the intervention of human judgment for the 
selection of any particular item. Test writers are typi- 
cally obscure about how they select the particular items 
they choose to write for an achievement test. When a ratio:- 
nale is offered, it usually involves the writer’s or some 
expert group’s subjective feeling about what they call the 
"important knowledge." This subjective feeling of importance 
is seldom, if ever, analyzed into a set of objective crite- 
ria. Undoubtedly, it includes consideration of the logic 
of the content area, the social utility of various portions 
of the content, and some judgment about whether the item 
is too difficult or easy for the students with whom the 
test is to be used. 

Some rigorous effort is made to develop a taxonomy of 
instructional objectives. As it is practiced, this is a 
rather naive gesture, for no algorithm is used for deriving 
the objectives from the instruction. Hence, the process of 
selecting the test items to be written usually represents 
a combination of judgments of what the test maker thinks 
ought to be taught plus what he anticipates will produce 



items having good statistical properties. The selection of 
items to be included in a test from among those items the 
test maker actually wrote is, when done at ail, based upon 
item difficulty and item inter -correlation indices. 

Cr iterion and Nor m Reference Testing ; Because of the 
way in which cloze items are selected, the cloze procedure 



S6ems ideally suited for making 
While traditional procedures of 



criterion referenced tests, 
selecting items may be fairly 



adequate where normative test informati (Tn IQ crmrrVil- •hVixanr 
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leave much to be desired where criterion test information is 
being sought. First, judgments of what knowledge is impor- 



tant are totally inappropriate when applied at the point of 



test construction. Judgments of this sort are appropriately 
made at the points of selecting content, forming the instruc- 
tion, and interpreting measures of the outcomes. Introducing 
them i':r.o the test construction process creates the possibil- 
ity that the outcomes of major port?.ons of instruction will 
be ignored. Specifically, it is the function of criterion 
measurement to measure whatever is taught — that is, what- 
ever appears in an unordered list of the content of instruc- 
tion. 



Second, traditional methods of measurement provide no 
objective method of defining the domain of knowledge taught. 
Since a taxonomy of content is not derived by any specifiable 
algorithm, there are no criteria by which to judge when it 
is complete, when it contains content not actually taught, 
or, for that matter, when it contains two statements of the 
same content . 

The construct of the unordered list of instructional 
statements is a version of a complete taxonomy which, if it 
could be constructed for each instructional program, would 









*tmsm 




41 - 



adequately represent the content of the program. But its 
construction depends upon a knowledge of intersentence syn- 
tax, a field of linguistic science that is, as yet, poorly 
developed. 

However, it may be possible to develop empirical pro- 
cedures for developing unordered lists. The value of such 
a procedure would be great, for the procedure would provide 
an operationally meaningful method of defining the test item 
population domain. The content domain would be represented 
by the sentences in the unordered list and the item popula- 
tion would be the items that could be written over the sen- 
tences appearing in, or derivable from the list. 

Third, basing item selection upon judgments or actual 
measurements of item difficulty are inappropriate for cri- 
terion referenced tests. Not only does the use of this cri- 
terion prevent the adequate sampling of the content, but it 
could lead to absurdities. It is easy to imagine the use of 
tiij.^ criterion resulting in the construction of an achieve- 
ment test that measured only trivial and ill taught parts of 
the content, if the instructional program for which the test 
was made were constructed to give much practice on the most 
important parts of the content. The use of inter- item cor- 
relations is even more difficult to justify, for the cor- 
relations are almost solely dependent upon the organization 
of instruction and the systematic relationship among the con- 
cepts in the content. 
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It should be clear from the discussions above that 
there are many similarities and even identities between test 
items made by the cloze procedure and tests made by conven- 
tional methods, A cloze test is any test made by deleting 
objectively definable language units from a passage. The 
units may be words, phrases, or clauses, since those units 
are objectively definable. The traditional wh- and comple- 
tion questions do just this and no more. 

It should be equally clear that cloze items cannot be 
constructed in the same manner as items used in norm refer- 
enced tests. Subjective judgments and judgments based on 
item statistics enter into the procedure for making norm 
referenced tests and, by definition, such judgments are ex- 
cluded from the cloze procedure. Specifically, a cloze test 
is made by a prespecified set of rules for selecting the 
language units to be deleted. 

Clo ze Procedure in Making Criterion Reference T ests : 

It seems that the cloze procedure is best suited for use in 
criterion referenced tests. If the content domain is de- 
fined as the ordered set of all sentences used in the instruc- 
tion, it follows that the item population is the set of all 
nodes dominating at least one lexical constituent. (A lexical 
constituent is a word or a phrase consisting of at least one 
lexical word.) 
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quickly to be very costly in terms of scoring time. Another 
alternative would be to present the item in a multiple choice 
format, offering among the alternatives the words deleted. 

I 

iiie other alternatives might be selected from among the most 
frequent responses when the test is given in a constructed 
response format. 

By far, the most difficult problem arises from the fact 
that the ordinal position-, of sentences transmit information. 
For example, boys get home first followed by They rode 
horses implies that the riding of horses caused the boys to get 
home first. When a sentence is taken out of context, the 
information may be lost. Yet, that information is part of 
the content. Until knowledge of intersentence syntax is 
sufficiently advanced. It may be necessary to state this 
information by subjective methods and place those statements 
among the sentences sampled. 

A final problem is the question,, of how to remove the 
effects (or suspected effects) of rote memory from cloze 
te-ts. When the student is presented with a cloze test over 
materials he has never read, his response can hardly be said 
to have resulted from rote memory; neither can it be said 
to represent the knowledge he achieved from having read the 
material. Conversely, when the test is made directly from 
the materials studied, it is impossible to exclude rote 
memory as a factor contributing to the responses. This is 
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a problem with all tests in which the items are derived di- 
rectly from the materials. Consider the sentence The boful 
wii^s^ daxed the morf . One can hardly be said to gain know- 
ledge, in any usual sense of the word knowledge, from reading 
this sentence. Yet most people can answer the questions 
of wugs daxed the m orf and T}^ --- wugs daxed 

the morf . 

This problem may be solved using two measures. First, 
before any sentence constituents have been selected for de- 
letion, the test maker might go through the sentences and 
replace randomly chosen constituents with semantically equiv- 
alent words, phrases, cr clauses. For example, if wugs were 
defined as durfs who gleb moxes, we could derive the sentence 






Jhl who gleb moxes daxed the morf . The second 

opersition, and this should also be performed before constit- 
uents are selected for deletion, is to perform one or more 
transformations in each sentence so that the sentence re- 
tains paraphrase equivalence with the original sentence but 
no longer has the same syntactic structure. The example 
sentence might become first The morf was daxed by the boful 

^ moxes and through subsequent transformations 
it might become the two sentences T he durfs who gleb moxes 
bo f ul . The morf was daxed by them . The items would 
then be formed by deleting nodes from the sentences that re- 
sulted from these transformations and substitutions. 
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Sum^: A process based on the cloze procedure might 

be used to make a criterion referenced test over an instruc- 
tional program in this manner. First, the test writer goes 
through the text generating sentences that make explicit 
the information contained in the sequential relationships 
between sentences. Second, samples of these sentences are 
drawn to provide tests of whatever number and size seems prac 
tically and economically desireable. Third, the test maker 
randomly selects lexical constituents and substitutes equiv- 
alent words or phrases for them. Fourth, the test writer 
performs one or more syntactic transformations on each 
sentence in such a way that paraphrase equivalence is pre- 
served. Fifth, he randomly selects from each sentence the 



node to be deleted and forms the 






siiuuj.u DC aDunaantiy clear that 



question. (By this time it 
it maxes little difference 



whether he chooses to write questions in a wh- question for- 
mat or in a deletion format.) Sixth, he gives the test to 
a group of subjects in a constructed response format and 
selects the distractor responses from among the highest fre- 
quency incorrect responses. Seventh, he forms the items into 
a multiple choice format, using the constituent deleted, as 
the correct response and the highest frequency incorrect re- 



sponses as the alternatives. 



Concluding Remarks 

xhe remarkable thing about the cloze procedure is not 
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that it produces a new kind o£ test, for there is actually 
a formal identity between conventional achievement test 
items and some of the items that can be made by the cloze 
procedure. Instead, the unique feature of the cloze pro- 
cedure is that it presents us with the algorithm for making 
criterion referenced tests over verbal instructional mate- 
rial. In its every-f ifth-word deletion -^orm, cloze procedure 
provides us with a valid and highly reliable method of 
measuring the relative difficulties of instructional mate- 
rials for students. Certainly, this constitutes an impor- 
tant contribution to the evaluation of instructional pro- 
grams, for once the difficulties of instructional materials 
have been suitably processed, the cloze procedure provides 
an appropriate procedure for generating criterion reference 
test items from those materials. 

However, it must be clearly understood that the cloze 
procedure is not a panacea for the construction of achieve- 
ment tests. Where the object is to obtain highly efficient 
norm referenced tests, the cloze procedure is of value only 
for defining the population of possible items that can be 
written. Furthermore, all criterion referenced tests devel- 
oped to measure knowledge gained as a result of studying 
verbally presented content will be less than ideal for as 
long as there is no satisfactory way to deal with intersentence 

syntax . 
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